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ABSTRACT 

The interrogation of genetic markers in environmen- 
tal meta-barcoding studies is currently seriously 
hindered by the lack of taxonomically curated refer- 
ence data sets for the targeted genes. The Protist 
Ribosomal Reference database (PR 2 , http://ssu- 
rrna.org/) provides a unique access to eukaryotic 
small sub-unit (SSU) ribosomal RNA and DNA 
sequences, with curated taxonomy. The database 



mainly consists of nuclear-encoded protistan 
sequences. However, metazoans, land plants, 
macrosporic fungi and eukaryotic organelles (mito- 
chondrion, plastid and others) are also included 
because they are useful for the analysis of high- 
troughput sequencing data sets. Introns and 
putative chimeric sequences have been also care- 
fully checked. Taxonomic assignation of sequences 
consists of eight unique taxonomic fields. In total, 
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136866 sequences are nuclear encoded, 45708 
(36501 mitochondrial and 9657 chloroplastic) are 
from organelles, the remaining being putative 
chimeric sequences. The website allows the users 
to download sequences from the entire and partial 
databases (including representative sequences 
after clustering at a given level of similarity). 
Different web tools also allow searches by sequence 
similarity. The presence of both rRNA and rDNA 
sequences, taking into account introns (crucial for 
eukaryotic sequences), a normalized eight terms 
ranked-taxonomy and updates of new GenBank 
releases were made possible by a long-term collab- 
oration between experts in taxonomy and computer 
scientists. 

INTRODUCTION 

The modern definition of the term 'protist' refers to 
unicellular eukaryotes that are either free-living or para- 
sitic, sometimes forming colonies, but without clear differ- 
entiation into tissues. This includes all eukaryotes other 
than land plants (and macro-algae), animals and fungi 
with differentiated tissues. Protists are notoriously para- 
phyletic and include a wide range of microorganisms using 
a huge variety of reproductive, nutritional and life-history 
strategies. Nevertheless, the term protist has pragmatic 
uses and has recently gained in popularity. Large-scale 
analysis of protistan diversity is complicated by their 
heterogeneity, which reflects their extremely broad distri- 
bution and implication in multiple ecological and func- 
tional processes. This difficulty is exacerbated by the 
following facts: (i) species delineation is often obscure 
owing to lack of clear morphological criteria and 
paucity of knowledge concerning processes of sexual re- 
combination; (ii) the taxonomy of protists has been rad- 
ically modified in recent decades in light of new 
phylogenetic data; and (iii) a large proportion of protists 
are probably still not cultivable or yet unknown. 
Molecular barcoding using SSU rRNA (Small Sub-Unit 
Ribosomal) gene sequences consequently has become ex- 
tremely popular among protistologists. Environmental 
barcoding has unveiled an extensive genetic diversity of 
protists in a wide range of ecosystems (1,2), including 
lineages only known by their genetic signatures (orphan 
environmental sequences). Recently, the use of next gen- 
eration sequencing (NGS) technologies targeting selected 
domains of the SSU rRNA gene has permitted ecological 
studies of complex assemblages at ever increasing scales 
(3-7). However, interpretation of such data is currently 
seriously hindered by the lack of taxonomically curated 
reference data sets. Unassigned and incorrectly assigned 
sequences are accumulating at an increasing and alarming 
rate in public databases, to the extent that in early 2012, 
almost 20% of submitted SSU rRNA eukaryotic gene se- 
quences had no or a very poor taxonomic assignation (see 
the website for more details). Undetected chimeric se- 
quences (8), as well as the presence of introns in gene se- 
quences (9), are also problematic. 



To facilitate and increase the efficiency and accuracy of 
NGS data sets analyses, we here present the first compre- 
hensive-curated database that places eukaryotic SSU 
rRNA gene sequences within a coherent ranked taxonomic 
framework covering eukaryotic diversity. Every sequence 
was quality checked and annotated using a multi-level 
taxonomic assignation. As a lot of protists are still only 
known by their environmental sequences, cluster names 
were retained when the formal taxonomy was missing 
[such as Syndiniales (10) and Marine STramenopiles, 
MAST (11)]. Although curated in less detail, sequences 
from metazoa, land plants and macrosporic fungi, as well 
as eukaryotic organelles (mitochondria, plastids, etc.), are 
also included in the database for their ecological interests. 
For example, protists may live in close association with 
metazoan (commensalisms, symbioses, etc.), and very 
small metazoan exists, inhabiting similar ecological 
niches. For example, copepods and polychaetes, as well 
as benthic animal larvae coexist with planktonic protists 
in aquatic systems. They may also have a great interest in 
ecological studies (as predators for example), even for 
protistologists. Even if this database is dedicated to 
protists, such outgroup sequences are of high relevance 
for extracting these groups in further analyses of NGS 
data sets when 'universal' eukaryotic primers are used for 
polymerase chain reaction (PCR) amplifications. 
Metazoan sequences in PR 2 allow not identifying them 
wrongly as new deep lineages of protists. 



MATERIALS AND METHODS 

The construction of this database started >10 years ago, 
and our procedure has been optimized over time (for more 
details, recent history detailed at http://ssu-rrna.org/ 
method.html). Here, we briefly describe the present 
general architecture of the database. 

Entries containing at least one partial SSU rRNA gene 
sequence of eukaryotic origin are retrieved from three 
public databases using keywords. Our last update retrieved 
484.657, 496.462 and 123 such entries from GenBank, 
EMBL and WGS-EMBL, respectively. An INSDC 
(http://www.insdc.org/) entry as defined by its accession 
number in public databases may contain several rRNA 
gene sequences, e.g. in long genomic fragments containing 
several partial or complete ribosomal operons. To allow 
such duplicated sequences within a single entry, each 
sequence was given a unique identifier, acc.pl.p2, where 
acc is the accession number of the entry containing 
the sequence, and pi and p2 are the first and last positions 
of the sub-sequence within the complete sequence. 

A majority of extracted sequences were shorter than 100 
nucleotides or around 500 nucleotides (63% of retrieved 
sequences), likely resulting from the recent integration of 
short environmental sequences derived from clone libraries. 
Only sequences longer than 799 nt were considered. 

The first step was the identification of sequences 
originating from organelles. A reference database of SSU- 
rRNA gene sequences from chloroplasts and mitochondria 
was constructed using entire genomes or genomic fragments 
that contained a SSU-rRNA gene sequence and a 
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protein-coding gene specific either of mitochondria or of 
chloroplasts. For derived-organelle sequences such as 
apicoplasts, hydrogenosomes and nucleomorphs, databases 
were manually built, using information found in scientific 
publications. These databases were used to determine by 
sequence similarity the origin of every sequence in the 
database. These sequences were assigned to a reduced taxo- 
nomic framework, including their location (such as: 
|Organelle|chloro-SSU| or |Organelle|mito-SSU|). These se- 
quences are not more detailed in the database. 

Introns were found to be a major problem in eukaryotic 
rRNA sequences compared with prokaryotic sequences 
(1536 sequences with intron(s) described, 10644 sequences 
with introns found by computation). A dedicated C++ 
algorithm was developed to identify the presence of 
introns in the remaining sequences (9). When detected, 
sequences with and without the intron(s) were generated 
(rRNA and rDNA sequences). 

Sequences in the PR database are assigned an identifier 
in the form accession. pl.p2_X, where accession is the ac- 
cession number of an entry, pi and p2 are the positions of 
this sequence in a larger genomic entry and X correspond- 
ing to introns treatment of the sequence [X = G: genomic 
sequence containing a described intron (rDNA); X = R: 
the previous genomic rRNA sequence, without the 
intron(s); X = U: no intron described, but intron(s) may 
be present; X = UC: introns were detected in silico and 
removed from the sequence (putative rRNA)]. 

Taxonomy of nuclear-encoded sequences 

As all SSU-rRNA genes are orthologs, a global phylogeny 
can be built, and essential past speciation events can be 
evidenced. This property is essential to build a ranked 
taxonomy. For example, at rank 1, there is a world-wide 
agreement to recognize three clades, Bacteria, Archaea 
and Eukaryota. We chose to additionally use 'Organelle' 
as rank 1. Organelles have a eukaryote origin when they 
are nucleomorphs and a bacterial origin when they are 
mitochondrion and plastid. Because evolution of organ- 
elles and their hosts differ over time, their taxonomy is 
different too. In addition, scientists working on diversity 
are more interested in the identification of the cells that 
bear such organelles. Our choice was thus to allow their 
easy identification (and filtering out) during the first step 
of an analysis, targeting them as 'Organelle' at rank 1. 

Nomenclature and terms of the following ranks mainly 
follows the classification of eukaryotes proposed by Adl 
et al. (12). Thus, the second rank describes each eukaryotic 
'Super-Group' or Phylum (both terms are in use in differ- 
ent communities): Alveolata, Amoebozoa, Apusozoa, 
Archaeplastida, Excavata, Opisthonkonta, Rhizaria or 
stramenopiles. The taxonomic descriptions are structured 
by the use of eight ranks, and following ranks mainly cor- 
respond to the division, class, order, family, genus and 
species. 

The terms used for each rank are non-ambiguous (a term 
cannot be found in two different clades), contain no space 
(that may pose problems to computers) and whenever 
possible retained if monophyletic. When monophyly 
could not be insured, the term of rank above was 



used, appended with suffix _X (suffix X if the above rank 
was already _X). As the same species name frequently 
occurs in different genera, the species name is composed 
of the genus and species, using '+' as a separator 
(e.g. genus = Diderma, species = Diderma + niveum). 
Genus and species names from public databases are 
stored in separate fields for comparison. 

For protists and unicellular fungi, a taxonomy was 
proposed by the group of experts, authoring this article. 
For multicellular fungi, plants and metazoans, the 
taxonomy was built mostly using the taxonomy assigned 
in National Center for Biotechnology Information 
(NCBI)'s GenBank database entries. We first built a 
core reference database containing 23 116 manually 
analysed sequences representative of eukaryotic diversity. 
These analyses included reading published articles and 
phylogenetic analyses done by the authors of this article 
when necessary. This core reference database was subse- 
quently used to automatically annotate the remaining se- 
quences using different methods. 

We are aware that for some clades such as metazoa, 
plants and fungi, our eight terms taxonomy is probably 
not as precise as it should be. Barcoding of metazoa and 
plants using SSU-rRNA sequences is not often used 
(normally only to complement Internal Transcribed 
Spacer (ITS) sequences). We will therefore try in a next 
release to propose an extended, still ranked and unified, 
taxonomy for fungi. 

An outcrop of PR 2 is the web-based tool KeyDN Atools 
(http://keydnatools.com/). It uses 159 982 specific short 
(15 nt) oligonucleotide sequences (named keys) generated 
from the core reference database. Each key is a signature 
present in sequences of a given clade, but not in those of 
other clades. Besides providing a very fast taxonomic iden- 
tification, it also allows for detecting putative chimeric 
sequences, as when different identifications are obtained 
from the 5' and 3' ends of sequences. 

Specific new computer programs mostly in C, C++ and 
Python have been developed. First, a new parallel 
distributed computing Needleman-Wunsch-based C 
program allowing to compute pair-wise distances not 
taking into account terminal gaps (partially overlapping 
sequences) and long internal gaps (introns). This was 
coupled to a newly rewritten C average linkage clustering 
program. Second, a new parallel distributed computing 
Needleman-Wunsch-based C++/Python program 
allowing to assign a consensus taxonomy to new sequences 
by comparison to a reference database (Crunch_Assign). 

When a conflict between taxonomies assigned using the 
different methods was found, it was manually solved. In 
the end, each nuclear encoded sequence is assigned an 
identifier in the form of this example: 

> A Y827845 . 1 . 1 765_U | Eukaryota | Apusozoa | Hilomon- 
adea | Planomonadida | Planomonadidae | Planomonadid- 
ae_Group-l | Ancyromonas| Ancyromonas + sigmoides 

RESULTS 

In total, we found 136 866 nuclear encoded sequences, five 
pseudo-genes (FJ854546, FJ854545, D14632, AF310844, 
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AJ404858, not included in PR 2 ) and 34 sequences we 
could only assign as putative rRNA sequences 
(HM538255, GU385678, AB275106, AJ628837, 
AY1 80011, CP000499, CP000499, AY256215, EU402432, 
AB017015, GQ330639, GU820811, JF488788, AF239231, 
DQ423737, DQ104596, AY835700, DQ423728, EU545797, 
GU072272, GU072526, GQ247249, HM174255, DQ104594, 
EU174762, FN598473, EU726200, EF695080, GQ483783, 
GQ462590, EU173354, EF567390, EF695215, HQ871039, 
not included in PR 2 ). Manual analyses of some of them 
allowed concluding for the presence of artefactual sequence 
internal or at the 5' or 3' end. Among nuclear-encoded se- 
quences, we detected 1756 putative chimeric sequences, 
either using the KeyDNAtools and/or by manual inspection 
(listed on the website). For example, sequence 
EF023694.1.1975_U is a chimera between parent sequences 
of Opisthokonta, Amoebozoa and Rhizaria in position 
179-471, 623-1264 and 1536-1925, respectively. Other '18S' 
sequences are nucleomorphs (262 sequences). In all, 9657 
sequences have a chloroplastic origin, 33 051 are from 
mitochondria, six from hydrogenosomes (AJ237907, 
AJ237908, AJ871215, AJ871217 AJ871267, Y 16670) and 
26 from apicoplasts (U87145, AB471801, AB471802, 
AB471803, AB471804, AB471805, AB471806, AB471807, 
AB471808, AB471809, AB471810, AB471811, AB471812, 
AB649417, AB649418, AB649419, AB649420, AB649421, 
AB649422, AB649423, AB649424, HQ1 10105, JQ437257, 
JQ437258, JQ437259, U28056). 

Within nuclear-encoded sequences, 54 data entries 
remained unassigned at the Super-Group level (Table 1), 
meaning that they could not be assigned to any specific 
taxon group within the domain Eukaryota 
(Eukaryota_X). The Super-Group 'Eukaryota_Mikro' 
was created for sequences HM563060, AF477623 and 
HM563061, for which no consensus has been reached 
for their affiliation, although Haplosporidiidae has been 
suggested (13). BLAST analyses conducted at NCBI 
against non-redundant or at DNA Data Bank of Japan 
(DDBJ) against all showed extremely weak sequence simi- 
larity with sequences of fungi. Using our global similarity 
tool (Crunch_Assign) showed no other sequence similar at 
>80% along the entire sequence. These results conducted 
to the creation of this new Super-Group (rank 2). For 
unassigned nuclear-encoded sequences (Eukaryota_X), 
either no other similar sequence was found or similar se- 
quences were detected but also annotated by us as 
Eukaryota_X. A BLAST on NCBI non-redundant 
(excluding environmental sequences) and at DDBJ (all) 
revealed that a large number of them probably contained 
undescribed introns. Therefore, these sequences probably 
require a manual curation, but again highlight the import- 
ance of intron identification in eukaryotic sequences. 

For lower taxonomic ranks, there were primarily two 
types of cases resulting in a failure to assign a taxonomic 
identity: 

(1) No agreement between experts to resolve at a given 
rank. For example, the genus (rank 7) is assigned, 
the order (rank 5) is assigned, but a family (rank 6) 
has not yet been described, or this rank is in fact 



Table 1. Number of nuclear-encoded sequences in PR2 as annotated 
at the Super-Group taxonomic level 



Super-group 


n 1 

n i 


11^ 


Alveolata 


9ft 

ZU / DU 




Amoebozoa 


1902 


1880 


Apusozoa 


254 


242 


Archaeplastida 


16 309 


16 092 


Eukaryota_Mikro 


3 


3 


Eukaryota_X 


54 


54 


Excavata 


2871 


2869 


Hacrobia 


2192 


2132 


Opisthokonta 


75 056 


74484 


Rhizaria 


7581 


7459 


Stramenopiles 


9884 


9640 


Total nuclear-encoded Eukaryota 


136 866 


135 110 


Apicoplast 


26 


26 


Chloroplast SSU 


9657 


9657 


Hydrogenosome SSU 


6 


6 


Mitochondrion SSU 


36051 


36051 


Nucleomorph SSU (18S) 


264 


262 



nl, total number; n2, excluding putative chimera; Super-Group, rank 2 
taxonomy. 



polyphyletic, with no proper descriptions of the dif- 
ferent families. 
(2) A given sequence is similar at the family level with 
several sequences from different families; however, 
they agree at the order level. 

In such cases, this sequence was assigned as... | Order | 
Order_X[Genus|Genus + species. If a genus was 
not described (i.e. uncultured), the taxonomy 
becomes: . . . |Order| Order_X[Order_XX|Order_XX + sp. 

More than 74000 sequences (54% of total number of 
sequences in the PR2 database) belong to Opisthonkonta 
(Figure 1). Alveolata and Archaeplastida are second in 
abundances (15 and 12%, respectively). Stramenopiles 
and Rhizaria represent 7.2 and 5.6 %, respectively. 
Others SuperGroups represent less than 2.2%. Only 
29.4% are complete or nearly complete. In total, 63.7% 
of sequences include the V4 region and only 12.1% and 
11.7% include the V9 region as recognized by primers 
Biomarks and Wamps (see the legend of Figure 1), 
respectively. Apusozoa, Hacrobia, Excavata and 
Opisthokonta have <10% of their sequences that 
include the V9 region. V9 region of Amoebozoa and 
Archaeplastida are better represented (34% and 25%, 
respectively, using the Biomarks primers). 

DOWNLOADS 

We provide several different ways of downloading the 
database or part of it (see more explanations at http:// 
ssu-rrna.org/downloads_eukaryotic_main_page.html). 

(1) The entire database or sequences of a specific clade 
can be downloaded using a taxonomy browser under 
fasta format, with sequence identifiers as described 
above. Putative chimera have been removed. 

(2) The entire database or sequences of major groups 
can be downloaded under fasta format, with only 
the short unique identifier. The corresponding 
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Nearly complete sequences 
All sequences 




Opisthokonta 



63-9-8 



0 5000 10.000 15.000 20.000 

Number of sequences in PR2 

Figure 1. Total number of SSU rDNA gene sequences in the PR2 database for each main eukaryotic lineage (all sequences = grey + black, complete 
or nearly complete sequences in light-grey). Note that nucleomorphs were extracted from Archaeplastida. Numbers indicated after bars indicate 
percentages of sequences that include the following: (i) the V4 region as defined by primers forward CCAGCASCYGCGGTAATTCC and reverse 
ACTTTCGTTCTTGATYRA used during the European Biomarks project; (ii) the V9 region as defined by primers forward 
GTACACACCGCCCGTC and reverse TGATCCTTCTGCAGGTTCACCTAC used during the European Biomarks project; and (iii) the V9 
region defined by primers forward TTGTACACACCGCCC and reverse CCTTCYGCAGGTTCACCTAC used by the WAMPS project. For 
Opithokonta, number in white = total number of sequences. 



taxonomy is then downloaded as a tabulated file. 
This fasta format is appropriate to use in tools that 
do not allow for long sequence identifiers. They are 
also easier to use in large computations, as they 
spare the memory required. Finally, they are easier 
to use in pipelines or web sites (see below). 

(3) The entire database, taxonomies and sequences 
under tabulated format, for easy import in relational 
databases. 

(4) The entire database or sequences of a specific clade 
under fasta format, with sequence identifiers as 
described above, but after a clustering by sequence 
similarity (98, 96, 92%) and choosing only the 
longest sequence as representative of the cluster. 

(5) Phylogenetic trees are available for the main groups. 
They were built using pair-wise distance computa- 
tions (not taking introns as differences as explained 
above) and FastMe (14). 

(6) Finally, we provide an 'arb' filter that allows to 
easily import a fasta file (with taxonomy in the iden- 
tifier) into an arb database, separating sequences and 
taxonomy as required. 

(7) In silico extracted domains corresponding to regions 
widely used in published articles and corresponding 
to several couples of primers. 



SEARCHING THE DATABASE 

We provide the following additional kinds of tools: 

(1) A search by keywords, allowing to search according 
to taxonomy, accession number and PMID (PubMed 
ID: retrieval of sequences described in a given pub- 
lication). Retrieven sequences can be filtered accord- 
ing to length, quality and when containing the 
variable V4 of V9 domains (often used in conjunc- 
tion with deep sequencing). 

(2) A search by 'sequence signature', with a link to the 
KeyDNAtools website (http://keydnatools.com/). 
This tool provides very fast results even for files con- 
taining many sequences. It also allows for detection 
of putative chimera as explained above. 

(3) A BLAST search against the database, as usually 
found on most sites. 

(4) A search (Crunch_Assign) using our modified global 
(Needleman-Wunsch based) algorithm that returns 
the most similar hits based on the entire alignment 
of the sequences, and not based on a good local 
alignment (high scoring pair, in BLAST). As a 
result, the percentage of similarity computed is 
more in agreement with what would be found using 
a Multiple Sequence Alignment [Clustal (15), Muscle 
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(16), MAFFT (17),. . .] before computing distances. 
It allows or does not allow accounting for introns 
as described above. 

(5) A search of one or two primer motifs in sequences, 
returning every sequence that contains the primer(s) 
with International Union of Pure and Applied 
Chemistry (IUPAC) encoding allowed and also the 
possibility of mismatches between primer and 
sequence (a C program). 

(6) In silico extracted domains corresponding to regions 
widely used in published articles and corresponding 
to several couples of primers. 

Both BLAST and Crunch_Assign similarity searches are 
coupled to BLAST2Tree or Crunch_Assign2Tree that use 
our Scriptree software (18). Similarity search results can 
simply be copied and then pasted in the '2Tree section'; a 
phylogenetic tree is built and displayed on the fly, with 
taxonomic assignations (as chosen by the user) displayed 
in regard of each leaf. This section also allows download- 
ing the sequences that have been pasted and the taxonomy 
as a tabulated file (19). 

CONCLUSION AND PERSPECTIVES 

There are presently three databases, SILVA (20), RDP 
(21) and Green genes (22), offering a curated taxonomy 
for prokaryotic SSU rRNA sequences. Only SILVA add- 
itionally provides reference sequences for SSU-rRNA se- 
quences of eukaryotic origin, curated for sequence quality 
but using the NCBI taxonomy (although recently a 
'SILVA' taxonomy is now proposed). Because our 
sequence identifier, i.e. accession. pi. p2, is similar to that 
used by SILVA, both databases can be easily compared. 

Based on the last release 111, 1518 of the 71 787 eukary- 
otic SILVA reference sequences are not present in the PR 2 
database. Manual checks showed that these sequences cor- 
respond to sequences extracted from entries in which no 
annotation allowed to identify the presence of a 
SSU-rRNA sequence, annotated as mRNA or annotated 
as prokaryotes. In all, 670 sequences identified as 
mitochondria were not in PR 2 ; none of the SILVA chloro- 
plast sequences was absent from PR 2 . Missing sequences 
will be soon analysed and incorporated in PR 2 . On the 
other hand, 53 735/7774 nuclear, 31 492/29763 mitochon- 
drial, 462/18 chloroplastic and 133/80 other organelle 
sequences present in PR 2 were not in SILVA reference 
sequences and SILVA entire database, respectively. This 
can be largely explained by the use of drastic filtering steps 
used by SILVA both in minimal length and sequence 
quality. However, because we are also users of such data- 
bases to analyse NGS data sets, we detected two major 
reasons not to use too drastic quality filtering. First, rep- 
resentatives of novel environmental clades are often found 
within clone libraries with length of <1000nt. Also, use of 
extreme quality filters may remove important sequences 
representatives of environmental groups, too short and/ 
or having poor quality at one of the end of a sequence 
(one-step Sanger sequencing without enough noise treat- 
ment for example). In PR 2 , sequence quality was indirectly 
inferred by the quality of the taxonomic assignation 



because bad-quality sequences became poorly assigned. 
Again, as sequence identifiers are similar between both 
databases, sequences can be easily compared between 
both databases. 

The PR 2 database possesses several valuable comple- 
mentary tools or databases lacking in other databases. 

A ranked taxonomy 

As for the PR 2 database, SILVA taxonomy for eukaryotes 
now offers a taxonomy based on the structure proposed 
by Adl et al. (12). However, contrarily to SILVA, we 
proposed a normalized eight terms ranked taxonomy for 
every sequence in the database. We proceeded to this 
'normalization' from our experience in dealing with very 
large data sets using automated pipelines, and a depth of 
sequencing that revealed organisms spanning the entire 
spectrum of known living organisms. When considering 
the NCBI taxonomy for example, two sequences of 
Perciformes were found described using 22 ranks 
(AY263842 and EF470892 for Perciformes), whereas 
another Perciforme (AF1 12595) was described using 
only 15 ranks, and 10 360 sequences of Perciformes had 
between 16 and 21 ranks. Numerous examples exist for 
protists. A very good example is for the genus 
Carpediomonas. NCBI classify this genus within 
Eukaryota (rank 1), Fornicata (rank 2), Carpediomonas 
(rank 3). However, sequence AY1 17416 {Carpediemonas 
membranifera, 23) has no rank 2 taxonomy in its entry. 
As a result, it becomes extremely difficult using a 
computer and the lists of terms provided by a non-ranked 
taxonomy to identify for two different sequences, which 
members of the two lists indeed correspond to the same 
rank. This is the problem solved by our ranked 
taxonomy, thanks to a worldwide list of taxonomic 
experts. As an example, taxonomy of sequence AY1 17416 
becomes Eukaryota | Excavata | Metamonada| Fornicata | For 
nicata_Group-2 1 Carpediomonas-like | Carpediemonas | Car- 
pediemonas + membranifera in PR 2 . In SILVA, this 
sequence is linked to a 7 terms taxonomy, but taxonomy 
is seemingly not ranked and unified. 

When occurring, missing ranks are automatically 
replaced in PR 2 (labeled as clade-i_X, where clade-i is 
the term for the next higher rank). This strategy allows 
rapidly inferring the taxonomy at the most probable 
higher rank and provides a rapid method for screening 
putative novel lineages at each taxonomic level. 

Introns 

Most SSU rRNA databases and biodiversity analyses of 
prokaryotes understandably neglect introns. Although 
found even in Escherichia coli (24,25), introns are rare in 
Bacteria and not very abundant in Archaea. Even when 
present, they have not yet been, to our knowledge, 
described in rRNA gene sequences. However, in 
Eukaryota, introns can be relatively abundant in rRNA 
gene sequences at least in some groups (9). This led us to 
incorporate in our database both the rRNA and the 
rDNA sequences. As most NGS (or clone library) 
analyses of the biodiversity are dealing with PCR ampli- 
fication of extracted gDNA, introns may represent a large 
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part of the variability observed. Having genomic se- 
quences, in addition to the rRNA transcript, in the 
database is important, not only for searching by similarity 
but also for the in silico estimation of expected amplicon 
lengths. 

Organelles 

Organelles are often poorly treated in reference databases. 
For hydrogenosomes (AJ237907, AJ871215, AJ871217, 
AJ871267, Y16670), only sequence AJ871217 can 
be found in SILVA labeled as 'Unclassified'. For 
GreenGenes, sequences were not found when searching 
by accession number. At RDP, the classifier resulted in 
every case into 'unclassified_Bacteria'. For the 26 
apicoplast sequences, none was found in SILVA reference 
sequences or in the 'ssu-accession-parc.acs', release 111 
(3 186762 accession numbers). Even for better-known 
organelles, taxonomic assignation is not really better. 
For example, sequence AB000109 mitochondrion of 
Dictyostelium discoideum is labeled as 'Unclassified' in 
SILVA. Chloroplasts are generally well identified in 
SILVA. However, among the chloroplastic sequences 
detected in this study, 263 were found in SILVA reference 
sequences as chloroplasts. Our approach to build inde- 
pendent databases for these organelles allowed us to 
probably reach a more precise taxonomic affiliation of 
organelles. Having such prokaryotic organelles in our 
database is essential with NGS data sets of both 
prokaryotes and eukaryotes because the use of 'Bacteria' 
or 'Eukaryota' specific primers resulted in some cases in a 
significant proportion of amplicons that are in fact of 
Organelle origin (3-7). Even if Organelle sequences are 
simply discarded from the final analysis, this database 
avoids identifying these sequences as some new deep 
lineages. 

Chimeric sequences 

Chimeric sequences are PCR-generated hybrid products 
between multiple parent sequences that can be falsely in- 
terpreted as novel organisms, thus inflating apparent di- 
versity (8,26). The two algorithms most widely used for 
16S chimera detection are Pintail (27), included in RDP 
and SILVA databases, and Bellerophon (28) included in 
GreenGenes. In all cases, chimera are detected by 
comparing independent regions of a sequence alignment. 
The KeyDNAtools does not require the prior alignment of 
sequences, and it is particularly efficient to detect complex 
chimera having more than two parent sequences, or 
between two closely related parents. This tool can be 
used in concert with other detection methods. Our 
database, which has been screened for putative chimera, 
offers two possibilities of download: either including or 
excluding putative chimeric sequences. 

Similarity searches 

BLAST is a widely used tool that finds regions of local 
similarity between sequences. However, such search based 
on a good local high scoring pair could lead to very bad 
results. We thus developed two independent methods of 
assignation. The first one, the Crunch_Assign software is 



using a Needleman-Wunsch algorithm. It is also faster 
than BLAST and returns a score computed on the entire 
alignment. Because we are working on Eukaryotes, we 
also included the possibility of ignoring putative introns 
(to our knowledge, this possibility is not included in any 
other software). The second one, the KeyDNAtools is also 
very fast and offers additionally chimera detection as dis- 
cussed above. In >95% of cases, both assignations 
provide similar results. Sequences not annotated by the 
KeyDNAtools likely result from the absence of the cor- 
responding clade in the core reference database, low 
quality sequences or novel variants of the gene present 
in newly available sequences, not yet included in the 
core data set. Conversely, sequences not assigned by the 
Crunch_Assign software are often chimera or low-quality 
sequences. After a search by similarity, we offer the pos- 
sibility to build a phylogenetic tree on the fly, using most 
similar sequences found by BLAST or Crunch_Assign. 

Updates 

We have developed a pipeline that allows to analyse a 
GenBank new release within a week. Most of the time 
spent is indeed in manual checking of conflicts after 
average linkage clusterings, as explained previously. As a 
result, updates of the PR 2 database will be done shortly 
after each GenBank new release. As a result, numbers 
provided in this article will probably differ from that avail- 
able from PR2 at publication time of this manuscript. 
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